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Abstract 

Consider  algorithms  which  are  designed  for  shared  memory  models  of 
paraiiei  computation  in  which  processors  are  allowed  to  have  fairly 
unrestricted  access  patterns  to  the  shared  memory.  General  fast 
simulations  of  such  algorithms  by  parallel  machines  in  which  the  shared 
memory  is  organized  in  modules  where  only  one  cell  of  each  module  can 
be  accessed   at  a   time  are   proposed. 

The  paper  provides  a  comprehensive  study  of  the  problem.  The 
solution  involves  three   stages: 

(a)  Before  a    simulation,    distribute   randomly   the   memory  addresses    among   the 
memory  modules. 

(b)  Keep  several  copies    of   each  address    and   assign   memory   requests    of 
processors   to   the    'right'   copies   at   any  time. 

(c)  Satisfy   these  assigned   memory   requests    according    to   specifications    of 
the  parallel  machine. 


A  preliminary   version  of   this    paper  was   presented   at 
the   9th  Workshop   on  Graghtheoretic   Concepts    in  Computer   Science 
(WG-83),    Fachbereich   Mathematic,    Universltat   Osnabruck,    June    1983, 


-2- 

I.      Introduction 

Consider  algorithms  designed  for  models  of  parallel  computation  In 
which  processors  have  access  to  a  shared  memory  and  parallel  machines 
in  which  this  shared  memory  Is  organized  in  modules  where  only  one  cell 
of   each  module  can  be  accessed   at  a   time. 

The  problem  of  simulating  such  algorithms  on  these  machines  Is  the 
problem  of  granularity  of  parallel  memories  (granularity.  In  short). 
Every  Intuitive  idea  for  coping  with  the  granularity  problem  has  to  be 
analyzed  for  alternate  formal  settings  of  assumptions  for  both  the 
model  of  parallel  computation  and  the  parallel  machine.  In  order  to 
overcome  this  difficulty  we  present  our  main  Ideas  on  a  setting  of 
assumptions  which  enables  us  to  simplify  the  presentation  by  limiting 
the  discussion  to  the  actual  problem  which  Is  overcome  by  our  ideas. 
Extensions  of  the  ideas  to  alternate  models  of  computation  and  machines 
are   given  later    in    the    paper. 

We  study  ways  by  which  the  second  parallel  machine  model  below  can 
simulate  the  first.  Both  machine  models  employ  p  processing  elements 
(PE's  or  processors)  which  operate  synchronously  and  N  common  memory 
ceils.  In  the  first  model  each  processor  has  access  to  each  of  the  N 
ceils  in  each  time  unit.  We  forbid  only  the  case  where  two  (or  more) 
processors  seek  access  to  the  same  ceil  at  the  same  time.  This  is  the 
Exclusive-Read  Exclustve-Wrlte  Parallel  Random  Access  Machine 
(EREWPPAM).  It  is  based  on  (Lev,  Plppenger  and  Valiant].  Our  second 
model  of  computation  Is  called  Module  Parallel  Computer  (MFC).  The 
common  memory  of  size  N  Is  partitioned  Into  m  memory  modules.  Say  that 
at  the  beginning  of  a  cycle  of  this  model  the  processors  issue  Rj 
requests  for  addresses  located  in  ceils  of  module  ],  0  <  j  <  m-1.  Let 
R  =  max   {|RJ/0  <    j  <    m-1}.      Then   the   requests    for   each     module      are 

queued  in  some  order  and  satisfied  one  at  a  time.  So  a  cycle  takes 
R  time.      We  assume   that    Immediately  after   a   simulation  of  a    cycle   is 

finished  every  processor  knows  it.  Figure  1  Illustrates  the  difference 
between  the  EREWPRAM  and  the  MFC.  The  problem  of  sirauiatlng 
efficiently  one  cycle  of  the  EREWPRAM  by  the  MFC  is  taken  as  the 
definition  of  the  granularity  problem  in  the  next  two  chapters  which 
Include      the     main      contribution  of   this    paper.      When  N  =  m  the  MFC   can 
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slmuiate  a   cycla  of  the  EREWPRAM   in  one   time  unit  while   when     N  >   mp      a 
naive   simulation  my   result   in  R^j^x  ^^    large  as   p. 

The  survey  paper  [Kuck]  emphasizes  the  importance  of  the 
granularity  problem.  It  reports  about  the  considerable  attention  this 
problem  has  received  in  the  literature  by  mentioning  fourteen  papers 
that  dealt  with  it.  Most  of  these  papers  suggest  strategies  for 
partitioning  the  memory  addresses  among  the  modules  for  algorithms  that 
either  have  access  patterns  which  are  known  in  advance  or  have  access 
patterns  in  successive  time  units  which  satisfy  some  probabilistic 
assumptions.  Our  attitude  Is  completely  different.  We  present 
solutions  and  analyses  for  the  general  problem  of  simulation.  They  do 
not  depend  on  the  access  behaviour  of  the  algorithms  being  simulated. 
This  is  in  sharp  contrast  to  both  classes  of  past  research  mentioned 
above.  In  this  spirit  [Vishkin  and  Wigderson]  observed  that  in  a  few 
general  cases  the  idea  of  dynamically  changing  location  of  addresses 
among  madules  throughout  the  performance  of  an  algorithm  enables 
efficient  simulations  utilizing  only  a  moderate  number  of  modules 
(m=p). 

Our  research  is  motivated  by  the  Ultracomputer  project.  The 
NYU-Ultracoraputer  group  ([Gottlieb  et  al.])  believes  that  a  machine 
using  4096  processors  and  4096  memory  modules  will  be  available  by 
1990.  The  MPC  represents  actually,  an  abstract  Ultracomputer  design 
which  idealizes  only  one  point:  the  interconnection  of  processors  and 
memory  modules.  A  significant  part  of  this  project  involves  heuristics 
for  the  granularity  problem.  The  Ultracomputer  is  a  general-purpose 
parallel  computer  that  may  be  used  for  any  parallel  algorithm.  Our 
general  solutions  are,  therefore,  of  particular  relevance  to  Its 
design. 

The  present  paper  is  similarly  motivated  by  the  parallel-design 
dlstributed-implementation  (PDDI)  machine,  proposed  in  [Vishkin  82]. 
The  PDDI  machine  forms  a  counterpart  to  the  Ultracomputer  which  differs 
from  it  mainly  at  the  following  point.  Its  interconnection  network, 
between  processors  and  memories,  performs  well  in  the  worst  case  while 
the  interconnection  network  of  the  Ultracomputer  performs  well  in  the 
ave  rage . 

Part      of      this      research      can     also  be   motivated  by  some   data    base 
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applicatlons;        where,        for       instance,        there        are        p  processes 

(transactions),  m  servers  (resources,  disks)  and  N  files  distributed 
among  the  servers,  so  that  each  server  may  serve  at  most  one  process  at 
a  time  (a  resource  is  locked  by  a  transaction).  The  case  where  a 
processor  may  require  only  one  file  at  a  time  readily  fits  our 
framework.  However,  a  few  alternative  assumptions  regarding  how  many 
files  can  be  required  simultaneously  by  the  same  process  may  reflect 
different  circumstances.  It  might  be  interesting  to  investigate  which 
such  assumptions  fits  our  framework  and  possible  extensions  to  others. 
Ue    do  not   elaborate    on    this    motivation  any   more    in   this   paper. 

We  provide  a  three  stage  study  of  the  granularity  problem.  The 
ideas  of  each  stage  can  be  applied  separately  or  in  conjunction  with 
the  others.  The  first  stage  is  designed  to  keep  us  'out  of  trouble', 
in  the  first  place,  in  the  average  case.  The  key  idea  behind  the 
proposed  approach  is  to  utilize  universal  basing  in  the  simulating 
machine.  The  MPC  itself  picks  at  random  a  hashfunction  from  an  entire 
class  of  hashf unctions ,  instead  of  a  specific  hashfunction.  This  is 
shown  to  keep  memory  contention  low.  The  idea  behind  the  second  stage 
is  to  keep  several  copies  of  each  memory  address  in  distinct  memory 
modules.  This  idea,  in  conjunction  with  fast  algorithms  for  picking 
the  'right'  copy  of  each  address  request,  is  shown  to  decrease  memory 
contention  in  the  worst  case,  for  the  less  fortunate  cases  of  the  first 
stage.  Our  above  definition  of  the  granularity  problem  made  the  third 
stage  somewhat  indistinct.  In  simulations  of  other  models  than  the 
EREWPRAM  by  the  MPC  or  other  machines,  the  problem  of  simulation  is  not 
conpletely  solved  by  specifying  for  each  address  request  the  module 
that  satisfies  it.  Problems  like  scheduling  the  requests  for  a  module 
(in  case  queues  are  not  available)  or  combining  simultaneous  requests 
for  the  same  address  in  the  same  module  may  arise.  They  are  solved  in 
the  third  stage.  Chapters  II,  III,  and  IV  discuss  the  respective 
stages. 

We  wish  to  point  out  a  typical  difficulty  that  we  had  to  cope  with 
in  all  stages.  Every  solution  we  suggest  Is  beneficial  only  if  we 
combine  it  with  an  efficient  parallel  algorithm.  By  efficient  we  mean 
that   it    is   fast   and   does    not  use    too  much    local   or   common  memory.      Note 
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that  since  the  worst  R__^  ^^^  we  want  to  improve  is  p,  our  algorithms 
have  to  be  significantly  faster   than   that. 

II   1.      A  Probabilistic  Simulation 

In  this  section  we  begin  to  study  a  simple  probabilistic 
simulation  of  PRAMs  on  MPCs.  Consider  a  PRAM  with  p  processing 
elements  and  a  shared  memory  of  size  N,  Also  consider  an  MPC  with  p 
processing  elements,  and  a  shared  memory  of  size  N  which  is  divided 
into  m  modules.  More  precisely,  let  memory  module  MM.,  0  <  j  <  m  , 
contain  all   (physical)    addresses   a   with    0  <    a   <  N  and   a  mod  m  =   j. 

Our  probabilistic  simulation  is  based  on  universal  hashing  as 
introduced  by  Carter /Wegraan.  Let  H  be  a  subset  of  S^,  the  full  set  of 
permutations  of  [0...N-1].  We  use  elements  of  H  to  make  the  connection 
between  logical  and  physical  addresses.  More  precisely,  we  proceed  as 
follows : 

Initialization:  Choose  h  e  H  at  random  and  store  h  in  every  processing 
element  of  the  MPC.  The  i-th  PE  of  the  MPC  will  run  Che  same  program 
as  the  i-th  PE  of  the  PRAM  to  be  simulated.  We  maintain  the  invariant 
that  cell  h(a)  of  the  MPC  has  the  same  content  as  cell  a  of  the  PRAM 
for  0  <    a  <■    N-1. 

Step  by   Step  Simulation: 

Let  aj  be  the  (logical)  address  generated  by  the  i-th  PE  of  the 
MPC.  Apply  h  to  aj  and  obtain  (physical)  address  b.  =  h(a,).  Issue  a 
request  for  memory  cell  b^.  This  describes  the  behavior  of  the  1-th  PE 
of  the  MPC,  1  <  i  <  p.  >temory  module  MM.  ,  0  <  j  <  m,  collects  ail 
requests  for  cells  in  MM.  and  serves  them  sequentially.  When  all 
requests    are    served    the   next   cycle   of   the   PRAM   is  simulated. 

Of  course,  the  quality  of  the  simulation  described  above  depends 
crucially  on  class  H  of  permutations  used  in  the  simulation.  Note  that 
the  sinulation  is  probabilistic  because  h  is  chosen  at  random  from 
class  H.    We  want 

1)      H  to  be   small;    because   every   PE   needs   additional    local   memory  of 
0(log|H|)  bits   (assuming    a  suitable  encoding)    to  store   an   element   h  e    H. 
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2)  random  elements   of   H  to  be  easy   to   generate;    because   this  will 
hold  the   cost  of   the   initialization  phase   small. 

3)  elements  h  e   H  to  be  easy  to  evaluate;   because   this    determines   the 
cost  of   translating   from  logical   to  physical   addresses. 

4)  the  length  of   the   queues   arising    in  the    simulation  to  be   short; 
because   they  essentially  determine   the   quality   of    the   simulation. 

We  will   next   study  expected   queue   length  in   more    detail.      Let    S   = 
{ai  ,a2 1  •  •  •  t^    }     ^    [0...N-1]    be  a   set    of   p   addresses.      Let     h  e    H     be      a 
pennutation.      Define 


R.(h,S)    =   I{a  e    S;    h(a)    mod   m  =   j}i,      0  <    j<  m   , 

^max^^»^^    =       max     Rj(h,S),    and 

Of.  j<ra 

R^,   =        max  I  Rn^x^^'S^/l"! 

S^  [0.  .  .N-1]         h  e    H 

|S|=P 

R^ChjS)     is     the    length   of    the    queue   in    front    of    memory  module  MM.  when 

permutation  h  e    H   is   used   and   set  S      of      addresses      Is      issued  by     the 

processors.        ^^nax^^'^^      ^^    ''^^    length  of    the    longest   queue   in  front   of 

any  memory  module   under   the    same   conditions.      Next     )  R  „„(h,S)/|H| 

t-  max  '     ' 

is  the  expected  value  of  R^ax^  »S).  Finally,  R^^-,,  is  the  worst  case 
of  that  value  taken  with  respect  to  all  possible  sets  of  p  addresses. 
In  other  words,  K^ax  ^^  ^^^  worst  case  (with  respect  to  addresses) 
expected  (with  respect  to  random  elements  of  H)  length  of  the  longest 
queue. 

We     want      R  „_  to  be   small.      What    can  we    expect?    In   order   to   get  a 

max  "^  ° 

feeling  for  R^j^^  "®  will  briefly  study  a  limiting  case  first:  H  =  S^^, 
the  full  set  of  permutations  of  N  elements.  Note  Chat  class  Sj^  is  much 
too  large  (NlogN  bits  local  memory  would  be  required  in  every 
processor)    to  be   practically  useful. 

Theorem  J_:    Let   H  =   Sj,,  and  let    m  >    p.    Then 


^ma^  ^    raindog  m/log    (m/p),    log  m/log   log  m)  +    1 
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Proof:  Let  S  =  {ap...,a  }  C  [0...N-1]  be  arbitrary.  Let  p^  .  be  the 
probability  that  at  least  k  elements  of  S  are  mapped  into  iriemory  module 
j  by  a  random  element  h  e  Sj^.  Then  p^^  ,  <  (^)(l/m)  since  Images  of 
different  addresses  are  independent  and  uniformly  distributed.  Let  p^ 
be  the  probability  that  at  least  k  elements  of  S  are  mapped  into  some 
memory  module.      Then 


Pk  ^    Pk.O  +  Pk,l 


+.  ..+ 


Pk,m-1^    (^)(1V 


k-1 


Hence 


'^max  =  ).    Pk^    ^      min(l.[^)(l/m)'^-l) 
k>  1  k>  I 


<    \      min(l,(m/k!  )(p/m)'^) 
k>l 


Since   (p/m)    (m/k!)  <     1    for  k  >   min(log  m/iog   m/p),    log  ra/log   log  m)  and 
decreases   exponentially    as   a   function  of   k   for   larger  k,   we   have 


^mo-<r  ^    mindog  m/iog    (m/p),    log  m/log    log  m)    +    1     • 


We  infer  from  theorem  I  that  Rjj_„  <  1  +  log  m/iog  log  ra  if  m  >  p  and  H 
=  Sjq  and  that  Rj^^^^  <  2  +  1/e  if  m  =  p^"^  and  H  =  S^.  Unfortunately,  Sj^ 
is  too  large  for  our  purposes.  However,  close  inspection  of  the  proof 
of  theorem  1  suggests  methods  for  finding  smaller  classes  of  H  which 
yield  essentially  the  same  value  of  R^j^^  as  H  =  Sj^.  The  diagram  below 
shows  a  plot  of  pi^  as  a  function  of  k.  Our  bound  on  p>  decreases 
exponentially  for 
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k  >  L:=TOin(log  m/log(m/p),  log  m/log  log  m).  In  particular  p^  <  1 /p 
for  values  of  k  exceeding  L'  where  L'  is  only  slightly  larger  than  L. 
Quantity  Rma^  is  the  area  under  curve  p.  .  The  following  simple 
observation  is  crucial  for  the  sequel:  quantity  \^.^  Increases  by  at 
most  one  if  pv.  were  equal  to  1 /p  for  k  >  L'.  Thus  the  bound  on  R_„^ 
shown  in  Theorem  1  stays  essentially  true  if  we  only  know  that 
Pk  i  ^  (^)l'-/™]  fo'^  k  <  L'  and  that  p.  .  is  non-increasing  as  a 
function  of   k.    We   will   explore   this   approach   in  the  next    section. 
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II  2.      On  the     Expected     Length     of     the  Longest     Chain     In     Universal 
Hashing. 

Universal     hashing     was      introduced  by  Carter /Wegraan.      They   showed 

that  very  small  classes   of   hash  functions  suffice   to  obtain  an  expected 

case  behavior  which  is   similar  to   the   one  of   ordinary  hashing.      We  show 

in  this  section   that   universal   hashing   is  also   competitive   with    respect 
to  expected  worst   case   behavior. 

Def      [CW].  Let      c  £    R,  h  e    K,      N  e    B ,  ra  e    K.        A     multiset  H     C.  {h; 

h:[O...N-l]  *     [0...ra-l]}  is      c      strongly  k      universal      if        for  all 

a^ , . . .  ,a^  e  [0...N-1],                  palrwise  distinct,  and  ail 

bi  ,  . . .  ,bi.  e  [  0 .  .  .m—  1  ]  , 


{h  e   H;   h(a^)    =  bj^      for     1  <    1  <    k}  |   <    c  |h|/ 


m^ 


As    in      the      preceeding     section,      let      p  e    N     and      let     S    S    [0. ..N-1], 
|S]=p.      ForheHiet 


^max^^'S^    =     max      I  {a  e    S;    h(a)    =  j}], 


1 
(X  j<m 
and   let 


^Sax=     ^^^  I  '^max(^^.S)/|Hl 

Si  [0...N-1]      h   e   H 
|Sl=P 

We      are     now    in  a  position   to   state    the   observation   following  Theorem  1 
as 

Theorem  2:    Let  H  be   a    c    strongly   k  universal   multiset    of    functions   from 
[0...N-1]    to   [0...m-l].      Then 

^raax  <    ^   +  cpm(p/m)''/k! 


for  all  p  e    N. 


-10- 
Proof:     Let     S  [0...N-1],      |s|    =  p   be  arbitrary.      Let  p^(S)  be   the 

probability  that  Rinax^^*^)  ^  ^»  ^•^'  Pi^^^  =  |{he  H;  f^x^^'^^  ^ 
i}|/lH|.      Then  1   >    pj  >    P2  >    P3  >    ...      and 

s  <:[o...N-i]  1=1 
!s|=p 

k  +  p        max  Pk^^^    * 

Si  [0...N-1] 

|s!=p 

Next  observe  that  Pt<;(S)  <■  Py  q(S)  +  p^  ^(S)  +. . .+  p^  m-1^^^  where 
P|^  4(5)  is  the  probability  that  at  least  k  elements  of  S  are  mapped 
onto    j.    For  fixed   aj^  ,a2,  • .  •  ,a^  e    S    (palrwise   distinct)   we   have 

I  {h  e    H;    h(a^)    =  j      for      1  <    i  <    k}  |   <    c  |  H  | /m'^ 

since  H  is  c  strongly  k  universal.  Hence  p^  jCS)  <  c(P) /m  < 
c(p/m)^/k!    and  Py^(S)  <    cm(p/m)'*^/k!    for  all  S.  • 

Before  we    give   some    examples      of      strongly      universal     classes     we 
recall   the   following   lemma. 


Lemma  (Carter/Wegman).  If  H  ^  {h;h:  [0. . .N-Il>  [0.  .  .N-1] }  is  c  strongly 
k  universal  and  r:[O...N-l]  -»•  [0.  ..m-1]  is  such  that  lr"kj)|  <  [N/ra] 
for  all    j,    0  <    j    <  m,    then  nultiset 

H  =    {roh;h  e   H} 

is   c  strongly   k  universal  where  c   =    (m    [N/m]/N)'^c. 

Proof ;  Let  aj,,..,a^^^  e  [0...N-1]  be  palrwise  distinct  and  let 
b^,...,bj^e    [0...m-l].        Then      there      are        at        most         [N/m]  tuples 

c^ , . . .  ,ci^^  e  [0.  .  .N-1]  with  rCcj^)  =bi^forI<l<k.  For  every  such 
tuple   (c^,...,c^^)   there      are     at      most     c|H|/m        functions      h  e    H      with 
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h(aj^)    =  Ci     for      1  <    i  <    k. 
([N/m]/(N/m))^c. 


Thus  H   is   c  strongly   k.  universal  with  c 


Vfe  will  next  give  some  examples.  Applications  1  and  2  are  based  on  the 
fact  that  there  are  small  doubly  and  triply  transitive  permutation 
groups.  A  set  H  of  permutations  of  set  X  is  transitive  if  for  all 
a,b  e  X  there  is  h  e  H  such  that  h(a)  =  b.  It  is  doubly  (triply) 
transitive  if  it  contains  a  permutation  replacing  any  whatever  given 
ordered  pair  (triple)  of  elements  in  X  by  any  whatever  ordered  pair 
(triple)  of  elements  in  X.  The  reader  may  consult  [Carmichael,  Chapter 
VI]  for  a  detailed  discussion.  Application  3  uses  the  fact  that  a 
polynomial  of  degree  k  is  fixed  by  its  values  at  k+1  points,  i.e.  a 
random  polynoraisd  of  degree  k  maps  a  set  of  k+1  points  into  a  random 
set. 

Application  1.  Let  N  be  a  prime  and  let  H^  =  {h;h(x)  =  (ax+b}mod  N  for 
some  a,be[O..N-l],  a  *  0} ,  let  r(x)  =xmodmand  let  H i  =  |  roh; 
h  e  Hj}  .  Since  for  every  xj,X2,yi,y2  ^  [0...N-1],  xj  *  X2,  y^  *y2 
there  is  exactly  one  pair  a,b  e  [0...N-1]  such  chat  yj^  =  axj  +  b  mod  N 
and  y2  =  ax2  +  b  mod  N  class  H^  is  1  strongly  2  universal  and  hence  Hj 
is  (m[N/m]/N)^  strongly  2  universal.  For  m  =  p^  we  obtain  RP  _  <  2  + 
4/2!  =  4;  note  that  (ra[ (N/m) ]/N)  <  2.  Finally,  observe  that  H^  is  a 
set   of   permutations. 

Application  2_.  Let  N  be  a  prime.  For  a,b,c,d  e  [O...N-I]  with 
ad-bc  *    0  mod  N  define   hg^^^^^^:    [0...N-1]   U  fr)  *    [0...N-1]  U  {• }    by 


a,b,c,d 


(x) 


if   X  =»  <» 

if    X  =  d/c 
(ax-b) / (cx-d)    mod  N  otherwise 


Note  that  division  is  well-defined  since  the  Integers  mod  N  are  a 
field.  Also  the  first  two  clauses  in  the  definition  coincide  if  c  =  0. 
It    is   known    (a   proof   can  be    found    in      [Carmichael,      Chapter     VI])      that 
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\  b  c  d  -'•^  *  permutation  and  that  set  H2  =  {h^  ^  ^  d; 
a,b,c,d  e  [0...N-1],  ad-bc  *  0}  is  a  triply  transitive  group  of 
pennitations,  i.e.  for  ail  xpX2,X3  and  yi>y2»y3  ^  [0.  .N-1]  1/  {»} , 
Xi,X2.X3  and  yi,y2»y3  pairwise  distinct,  there  is  exactly  one  h  e  H2 
with  h(Xj,)  =  y^  for  1  <  i  <  3.  Thus  H2  is  1  strongly  3  universal  and 
hence  ^2  ~  ^^°^»  ^  ^  ^2^  ^^^  rCx)  =x  mod  m  is  8  strongly  3  universal. 
Note  that  (m[N/m]  )^  <  8.  Thus  for  m  =  p^  we  have  RP  ,  <  3  +  8/3!  = 
13/3.  Finally,  we  want  to  mention  that  h^  ^  d^'^^  ^^^  ^  evaluated 
efficiently  using  Euklid's  algorithm.  (cf.  [AHU],  p.  300-302).  More 
precisely,   b^  b  <-   d^^)   '^^^  ^  computed   in  O(log  N)  arithmetic   steps. 

Application   3:    Let   N  be   a   prime,    let   k   be  an    integer  and   let   H^      =      {h; 

h(x)      =        I  a,x        mod  N  for   some   a^^  e    [0...N-1]    and   a.    *    0  for   some 

i  >    1}    be^ne   set   of    all   polynomials    of    degree   at   most  k-1.      Since      for 

every        xj,...,x^        and        yp....y;^  e    [0...N-1],        Xp...,x^        pairwise 

different,    there    is    at   most   one   non-trivial   polynomial   h  of      degree      at 

most     k-1      with     h(x.)      =     y.    ^for      1  <    i  <    k     we    conclude    that  H.,    is    1 

strongly    k  universal.         Also     H^      =      {rob;      h  e    H-j}      is      c      strongly      k 

universal  with  c  =  (m  (N/ml/N)*^  <  (1  +  ra/N)^.  Thus  for  m  =  p^  "*" 
2/ (k-1)    ^^   j^^g 

^Lx<    ^   +'^    p2+2/(k-l)   p-[2/(k-l)lk/^, 
<    k   +  c/k! 

Another  interesting  choice   is   k  =    3  In  p/ln   in   p  and   n  =   p.   Then 
^max  <    !«■  +  c    p^/k!    =  0(ln   p/ln   In   p) . 


Application  3  should  be  compared  with  Theorem  1.  It  states  that  if  the 
class  of  hash  functions  is  restricted  to  the  set  of  all  polynomials  of 
degree  at  most  3  In  p/in  In  p  (note  that  there  are  only  ^^^  ^' ^^  ^^  P 
such  functions)  then  the  expected  worst  case  behavior  Is  as  good  as  if 
we  use  all  hash  functions  (and  there  are  N'  of  those).  Similarly,  if 
m   =  p  and     k   =    1    +   2/e      for      some     e    >    0     and      If    the   class   of    hash 
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functions  is   restricted   to  the   set   of   polynomials      of      degree     at     most 
k.  -   1      (note     that      there  are   only  N  =  N  '^    such   functions)   then  the 

expected     worst     case     behavior     is      almost     as      good      (except      for       a 

nultiplicative   factor  of   two)   as  if   we  use   all  hash   functions    (and  here 

N 
are     N       of     those).        Unfortunately,      class     H-,      is     not     a     class     of 

permutations      and     hence   class  Ho    cannot   be   directly  used    in  simulating 

PRA>fa    by  MFCs. 


Example  4;  Example  4  is  slightly  more  difficult  to  treat;  it  is  not  a 
direct  application  of  Theorem  2.  Let  N  =  2"  and  m  =  2^.  Then  [0..N-1] 
can  be  identified  with  Che  bit  vectors  of  length  n,  i.e.  [0..N-1]  = 
{0,1}  .  The  bit  vectors  of  length  n  form  a  vector  space  of  dimension  n 
over  the  field  of  two  elements.  In  a  vector  space  we  can  use  linear 
transformations  to  map  any  set  of  (linearly  independent)  vectors  into 
any  other  set  of  vectors.  This  suggests  to  consider  H^  =  {h ;  h:  {0,1}" 
+    {0,1}"  and  h(x)    =  Mx   for  some   n   by  n    (0,1)    -  invertible   matrix  M}. 

Here  matrix  multiplication  is  over  the  field  of  two  elements.  It 
is  important  for  the  application  in  Section  II  3  that  H^  is  a  set  of 
permutations.      ks  before,    let    r(x)    =   x   mod   m  and   H^   =    {roh;   h  e    H^}. 

Before  we  analyze  the  behavior  of  class  H^  we  show  that  elements 
of  H^  are  easy  to  find  by  a  probabilistic  algorithm.  More  precisely, 
we  show  that  a  significant  fraction  of  all  (0, 1 )-matrices  is 
invertible. 

2 
Lemm    1;     JH^]    =       n  (2"-2^)   >    2^"    >/e^^^ 

0<  i«;  n-1 

Proof ;  Lemma  1  is  well  known  and  can  be  found  for  example  in  E.  Artin: 
Geometric  Algebra.  We  include  the  very  short  proof  for  the  sake  of 
completeness.  Choose  M  column  by  column.  When  columns  1  to  i  have 
been  chosen  (and  are  linearly  independent)  then  column  i+1  nust  be 
different  from  all  linear  combinations  of  columns  1  through  i.  Hence 
there  are   2^-2     choices   for  column  i+l.      Thus 

IH,!    =  (2"-l)(2"-2)...(2"-2""^) 
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n 

2(n2)  n      (1-2"^) 


Next   observe  that 


n 
n  I      in(l-2"^) 

n      (1-2-i  >    e   i=l 

i=l  n 

-(7/5)   I      2"i 
>   e  1=1 


since    In(l-x)  >    -7x/5  for  (Xx<l/2 
-7/5 


>    e 


Lemma  1  shows  that  at  least  2  5%  of  all  (0 , 1 )-matrices  are  Invertible. 
Hence  invertible  (0 , 1 )-matrices  can  be  found  by  taking  a  few  random 
(0 , l)-ma trices  and  checking  for  singularity  by  Gaussian  elimination. 
We   will  next   show  that   H^    Is   a   good   class   of    hash   functions. 

For  Xi,X2.«««.x.  a  set  of  vectors  we  write  dim(xi  ,X2, . .  •  ,x.  )  to 
denote  the  dimension  of  the  space  spanned  by  Xi,...,x^.  Also,  if 
x  £■  {0,1}"  we  write  x.  for  the  b-dimenslonal  vector  consisting  of  the 
last    b  components    of    x. 

Lemma  2_:  Let        ap...,a^  e    {0,1}"^,         pairwise        different.  Let 

d    =  dim(a^-a2, . . .  ,a  j-a^^).      Then 


2 


Proof:  If  (Map^,  =  ...  =  (Ma^)^,  then  (M(aj-a^))^^  =  0  for  2  <  i  <  k. 
Assume  w.Jl.o.g.  that  a, -a-,, . . .  ,a  i -a.^j  are  linearly  independent.  Let 
X  be  the  n  by  d  matrix  whose  columns  are  a  j-a', , . . .  ,a  ^-a^^j .  Let  Xj  be 
the  first  d  rows  of  X  and  let  X2  be  the  remaining  n-d  rcws.  Assume 
w.i.o.g.      that  Xi    is    non-singular.      Let  M'    be   the   matrix  consisting      of 
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the     first   b  rows  of   M.   Let  M^   consist   of   the   first  d  columns   of  M'  and 
let  M2  consists    of   the  remaining  n-d   columns.      Then 


or 


MjXj  +  M2X2  =  0 


M^   =  -  M2X2Xj^ 


Thus  Ml    is  determined  by  the    choice   of  >\■^   and  hence    there   are      at     most 

2  2 

2"  -bd     ^      2"   /w!^     matrices     M   such   that    (Ma^^^   =. . .      =    (Ma^)^.      Recall 

that   m   =  2^.  • 


Lemma    3 ;    Let   S   =    {ap .  • .  ,a    }  ^   {0,1}".        Let      t^  ^^      be      the      number     of 
subsets    of  S   of    cardinality  k   and   dimension  i.      Then 


^k..  <  i!)  ^<L) 


0 

In   particular,    t^  £    =   0  for  2^    +  £    <  k    . 

Proof;  Any  subset  of  cardinality  k  and  dimension  I  can  be  written  as  a 
set  of  i.  linearly  independent  elements  plus  a  set  of  k-A  vectors  which 
are  linear  combinations  of  the  first  I  elements.  There  are  only  ( Pj 
choices   for   the   first   set   and  only   (|^_£j    choices   for   the   second  set.     • 

We    are   now    in  a   position   to  estimate    the   behavior      of      class     H-.        Let 
S   =    {apa2,....a    }  C  {0,1}"  be   arbitrary.      As    above,    let 


Pk   "  P^o^^^max^^'^^  <    k)    =    |{h  e    H4    ;    ^^y,{^,Z)  >    ^■)  \  /  l^^l 


Lemma   4;      a)     pj>p2'*P3>... 

„-A      „u ..7/5 

-k,il 
l>0 


\  -I  I 

b)      Pu   =  cm  /,      ''k  i   ™  where    c  =e 
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c)      pj^  <    2cin(p2'^/m)^°8  ^''^     for     2  <    k  <    log  m/p-1 


Proof:   a)   obvious 


b)  We   have 


pj^  <      ),        \  I  {h  e    H^;   h   maps   all  points   of  A  into   the   same    location}  ]/ |H^ 

|Ahk 
dim  A=£ 

<      I  I  C//-1 

l>0      f^S,  |A|=k 
dim  \=t 

by     Lemma      1,2      and   the   observation   that   dim  A=A ,    A={ai , . . . ,au}   implies 
dlm(ai -32, . .  •  ,a  , -a.  )   >    £-1 


1-1 
'-k.Jl    '" 
l>0 


'    c     I      t^^^   m 


by  the  definition  of  t^  j^ .  This  proves  part  b) . 

c)  For  k  >  2  we  have  t.  j,  =  0  for  £  <  logjc-l.   Hence   by   part   b)   and 
lemma  3. 

P;^<  cm).        (r)(k^£)'°"* 
lo^-K£«;k 

<  cm  ).        (p2'<^/m)^ 

lo^-l<£<k 

<  cm  (pl^/m)^''^-^   I      (p2'^/m)^ 

£>0 

<  2cm(p2'^/m)^^^-l 


since   p2^/m  ^    1/2   for  k<;    log(m/p)    -    1    .      • 
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Lemmas   1   to  4  directly    lead  to 


a)      there  is   a  function   f:    B  >   l,    f(p)    =  O(p^)  for  ail  e    >    0  such 


Theorem  3;    Let  N  =    2",  m  =    2°   and   let  H^  be   defined  as   above. 

».    f(p) 
that   for  m  >   pf(p)  we   have: 

^Lv  *^    20  logp/(iog(10  logp))2   +0(1) 


max 


1+e 
b)      Let  e    >    0.      For  m  =  p  we   have: 


RP       <    0(1) 
max  ^    ' 


Proof:      Note   first   chat 


^Lx  =  ^      Pk  '^  M  ^  ppk, 


for  every   kp    1   <    ^i   ^    P*      From  Lemma    A,c   we    conclude    further 

ki         iodci-l 
^lax  <    1<1    +   2  crap(p2    Vm)  ^ 


where    c  =  e   ' 

a)      Let    f(p)    =  max(22^°,    2^*^  iogp/iog(IO   iogp))    , 

Then   f (p)    =  0(p^ )   for   all  e    >  0.      Let   k^    =  log   f (p)/log   log  f (p). 
Then  k^  <    (log  f(p))/10  and  hence 

p2^Vm    <    p    f(p)l/10/p    f(p)    =f(p)-9/10 


Also   log  kj^   -   1   -   log   log  f(p)  -  log  log  log  f(p)  -  1  >    (log  log 
f(p))/2  since  log  log  f(p)  >  10.   Thus 
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pp^  <  2cnp  f(p)-9  log  log  f(p)/20 

<  2c  p2f(p)  f(p)-9  log  log  f(p)/20 

<  2c  2^  log  P  +  log  f(p)  -  9  log  f(p)  log  log  f (p)/20 

<  2c  22  log  p  -  7  log  f(p)  log  log  f(p)/20  ^^^^    ^^^   ^^^   ^^p^  ^  jq 

=  0(1)    since  7  log  f(p)  log  log  f(p)/20  >  2  log  p 

for  sufficiently  large  p.  Thus 

^Lx"^  1°8  f(p)/log  log  f(p)  +0(1) 

<  20  log  p/(log    (10  logp))2  +  0(1) 
b)      Let  m   =  p^"^  .      Then 

^lax^    ^l   +   2cp2-^(2''Vp^)los'^l-l 


for  all  kj,    1   <    k^  <    p.    Let   k^    be   such    that   2+e       =     e (log     k,-l),      i.e. 
ki    =   22-^2/e^      ^^^ 


^Lx*^    '^1   ^2cp2^    -e(logki-l)    ^'^iClogk^-D 
^  22+2 /e    +   2c  2(^-^2/e  )22+2/e 


0(1) 


Theorem  3  states  that  the  performance  of  class  H^  is  "close"  to  the 
performance  of  the  full  set  of  permutations.  More  precisely,  if  m  = 
p  then  R^ax  ~  0(1)    in   both   cases.      Of    course,    the    constants    involved 

are  dramatically  different.  Also  R^^  ''  ^(iog  p/log  log  p)  can  be 
achieved      by     class     H^  for  m  =   p    f(p)  where  f(p)   grows   slower   than   any 
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root  of   p.    If    the   fuii  set   of   permutations   is   used   there   RP  =     Ofloe 

■^  max  \-^"s 

p/log  log  p)   can  be  achieved   for  m  =  p. 

Why  doesn't  class  H^  behave  well  in  the  case  m  =  p?  The  answer 
comes  in  two  parts.  Firstly,  the  bounds  derived  in  lemmas  3,4  and 
hence  in  Theorem  3  are  not  sharp.  Secondly  and  more  significantly, 
class  H^  has  a  certain  deficiency.  Multiplying  by  a  random,  inverttble 
matrix  hashes  a  set  of  independent  vectors  very  well,  however,  it  does 
not  do  that  well  on  a  set  of  linearly  dependent  vectors.  A  variant  of 
Lemma    1    can   be  used    to   show  that   the   expected   dimension   of  a   random   set 

a^ a^   of    vectors      Ln      {0,1}'^      is      very      large,      namely     n     -     0(1). 

Unfortunately,  this  observation  is  of  no  use  since  we  are  dealing  with 
worst  case   behavior. 

We  will  now  discuss  a  possibility  to  overcome  this  problem.  As 
above,  let  N  =  2^^,  m  =  2  .  Assume  also  that  q  =  N-c  for  some  small 
constant    c   is   a   prime.      For  a  e    [l...q-il    let 


h^:    [0. ..2"-l]   *    [0...2"-ll 


be   defined  by 


(ax)    mod   q         if      x    <   q 


h,(x)    = 


if      q  <    X  <    N-1 


Note   that  h^  is   a   pertiutation.      We  conjecture 


Conjecture:    Let  ai,...,ap  e    {0,1} 


n     - 


[0...2"-l]      be      arbitrary.        For 


ae    [l...q-l]    let   dim^   be    the   dimension  of    set   h^(a j) , . . .  ,h^(a    ) .      Then 


I  dim^/Cq-l)   >    min(n,p)-f 

l<a<q 


for   some   small  constant   f.    Moreover, 
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|{a;    1  <    a  <  q    and  dim     =  min(n,p)    -  f-  l}  |  <    qp~ 


for  all  i  >    1. 

Informally,  the  conjecture  states  that  a  random  function  h^  turns 
a  set  of  p  vector  a^,..,,a  into  a  (nearly)  independent  set  of  vectors. 
Moreover,  it  is  very  unlikely  that  the  dimension  of  the  resulting  set 
of   vectors   is    niich   less    than   the   expected   dimension. 

Consider  class 


H4   =    {h:    {0,1}"  >    {0,1}^    ;    h(x)    =   (M-h^Cx))   mod   2^ 
for   some    a,    1   <    a   <  q,    and   some    invertibie   n  by    n  (0, l)-matrix. }. 
We    next    shew   that    class  H^   is   a   very   good    class   of    hash   functions 
provided    that    the   conjecture   is    true. 

Theorem  4_:      Let   H^   be    defined  as   above   and   let   m  >    2p,      If    the 
conjecture  above  is   true   then 


RP^^<    k  +    3cmpf  +  Vk! 


for  all  k,    f    <  k  <    min(n,p).      Here   c  =   e^^'^. 


Proof:        Let        S        =        {a^,...,ap}        £        [0..N-1].  As        above,         let 

^  =  prob(RT^jj(h,S)  >   k).      Then 

q-1 
P^   =     \       I  {h  e   H^    ;    h   maps   at   least  k   points 

a=l  g. 

of   h    (S)   into   the    same   location} |/ |H^| 
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'^      L       L  I  I  {h  e   H  A    ;   h   maps   all  points   of   h^(A) 

a-1  £>0       ^S  ^  ^        ^ 

[^     1.    /.\_i.  into   the   same   location}  |/ |H,  I 

dim  n    (.A;=x.  ^ 

a=l  £>0     ACS 

|A|=k 


dim  h^(A)=Jl 


by  Lemma   2 


=       I  1       I  {a;    dim  h^(A)    =  £ }  Icm^'Vq 

i»tfS     £J»0 
|A|=k 

min(n,k)-f-l 

<  I  I  ^^l-£p-(min(n.k)-f-e) 

a$S       £  =0 

lA|=k 

+       I  ct„l-(ni±n(n,k)-f) 

|A|=k 

<  2c(^)    m-pf-l^        +c[^]ml+f-'^ 


since  k  <    n  and  m  >    2p   by    assumption 
<    3c  (g]m  p^~^  =    3cnpf/k! 

since  m  >  2p  and  k  >  f  by  assumption.  The  proof  is  new  completed  since 
^max^   ^  -^PPk   f^'^  ^^^^-  • 

Theorem  4  has  an  interesting  consequence.  Assume  that  m  =>  2p  and  n  > 
log       p.        The        latter        assumption        Is      certainly      realistic.  Let 

k  =  (f+2)  log  p/log  log  p.  Then  rP^^  =■  (f+2)  log  p/log  log  p  +  6. 
This  is  basically  the  same  behavior  as  the  behavior  of  the  full  class 
of   permutations;    cf.      Theorem   1. 


-22- 

II  3.      Probabilistic;  Simulations  Revisited. 

We  will  now  apply  the  results  of  Section  II  2  to  the  probabilistic 
sinialation  described  in  Section  II  1.  We  assume  throughout  this 
section  that  operations  on  addresses,  like  multiplying  by  an  integer, 
take  unit  time. 

Theorem  J_:  Let  m  =  p  .  Then  a  T(n)  -  time  bounded  PRAM  with  p  PE's  and 
N  memory  cells,  can  be  simulated  by  a  randoraiced  MPC  with  p  PE's  and  m 
memory  modules    and    total  memory   of    size   N+2p    in  time   0(T(n)). 

Proof:  We  use  the  simulation  as  described  In  Section  II  1  except  that 
permitation  it  is  chosen  from  class  H,.  Every  processor  needs  two 
additional  storage  cells  to  store  it.  Also  one  step  of  the  PRAM  takes 
expected    time   0(1)   on    the   MPC.  • 

Theorem  2_:  Let  m  =  o"".  Then  a  T(n)  time  bounded  PRAM  with  p  PE's  and  N 
memory  cells  can  be  simulated  by  a  randomized  MPC  with  p  PE's  m  memory 
modules   and    total  memorv  size   N+4p    in  time    0(T(n)    log  N). 

Proof:  Replace  H,  by  H-,  In  the  proof  of  Theorem  1  and  observe  that  the 
functions   in  H2   can  be   evaluated    tn  time    O(log  N).     • 

Theorem  3:  a)  Let  m  =  p.  Then  a  T(n)  time  bounded  PRAM  with  p  PE's  and 
N  memory  cells  can  be  simulated  by  a  randomized  MPC  with  p  PE's,  m 
memory  modules  and  total  memory  size  (N+p)log  p  in  time  0(T(n)  log  p). 
b)  Let  e  >  0  and  let  m  =•  p^"^  .  Then  a  T(n)  time  bounded  PRAM  with  p 
PE's  and  N  memory  cells  can  be  simulated  by  a  randomized  MPC  with  p 
PE's,  m  memory  modules  and  total  memory  size  (1  +  2/e )(N+p)  in  time 
0(T(n)). 

Proof:  a)  We  use  the  simulation  as  directed  in  Section  II  I  except 
that  IT  is  chosen  from  class  H-j  with  k  =  3  log  p/log  log  p.  Then  every 
processor  requires  O(log  p)  cells  to  store  tt  .  There  Is  one  additional 
problem  now:  class  H^  not  a  class  of  penmjtations.  Rather  it  e  H-j  might 
map  up  to  log  p/log  log  p  distinct  points  Into  the  same  cell.  It  can 
certsiinly  map  no  more    distinct   points   into   the   same  cell     since     it   e   H^ 
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is  a  nontrivlal  polynomial  of  degree  at  most  log  p/log  log  p.  Therefore 
the  simulating  MPC  has  log  p/log  log  p  copies  of  every  memory  cell.  A 
memory  access  is  made  by  sending  the  original  PRAM  address  a  and  the 
modified  address  t  (a)  to  the  appropriate  memory  module.  We  can  then 
build  up  a  balanced  tree  (say)  for  all  addresses  which  are  mapped  to 
the  same  address  by  u .  Thus  access  time  within  a  memory  module  might 
be  as  large  as  O(log  log  p).  Also  the  expected  number  of  concurrent 
accesses  to  the  same  module  is  O(log  p/log  log  p)  by  Section  II  2, 
application  3  and  hence  it  takes  an  expected  number  of  O(log  p) 
MPC-steps  to  simiiate  our  PRAM  step.  Also  evaluation  of  a  hash 
function   in  H^   takes   O(log  p)    steps. 

b)  The  proof  is  completely  analogous  to  the  proof  of  part  a).  We 
choose  k  =  1  +  2/e,  Then  at  most  2/e  distinct  points  are  mapped  into 
the  same  cell.  Thus  totoai  memory  size  is  (1  +  2/e)(N  +  p)  and  it 
takes  an  expected  number  of  0(2/e  log  2/e)  MPC  steps  to  simulate  one 
PRAM  step.  • 


Theorem  4_:  a)  There  Is  a  function  f:  S  ->■  I  with  f(p)  =  O(p^)  for  all 
e  >  0  such  that:  if  N  =  2",  m  =  2^  >  p  f(p)  then  a  T(n)  time  bounded 
PRAM  with  p  PE's  and  memory  size  N  can  be  simulated  on  a  randomized  MPC 
with  p  PE's,  m  memory  modules  and  total  memory  size  N  +  p  log  N  in  time 
0((logN)^  +  T(n)(log  p/(log   log    p)^   +  log  N)). 

l+€ 

b)      Let     e    >    0     be      fixed.         If    m  =   p  then   the    time    bound   reduces    to 

0[  (logN)^   +  T(n)    log  n) 

Proof:    Replace   H,    by  H/    in  the    proof   of   Theorem    1    and  observe      that      an 

element      of     H^     can     be     stored    In   log  N  words  of    length   (log  N)  each. 

Thus   every  processor  needs    log   N  additional      memory      cells.        A     random 

element      of      H^      can     be      chosen     as      follows.        Cenerate      log  N   random 

bitstrings   of    length   log   N  each.      (in   time   O(log  N)")   and   check  whether 

the      (0-1)      matrix     M      generated    In   this    way    is    invert ibie.      This   takes 

■I 
time  O((log  N)    )   on   a  sequential   machine.      Also  0(1)   tries     suffice      on 

the      average.        Thus     choosing     a      random     element      of      H^      takes      time 

O((log  N)-^).      Next    observe   that    it    takes    time   O(log  N)   to     multiply      an 
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iog     N      by     log     N  matrix  by  a   vector.      The  time   bounded   is   now  an  easy 
consequence  of  Theorem  3   of  Section  II   2.     • 

We  do  not  feel  that  theorems  1  to  4  are  best  possible.  They 
conplement  each  other  in  that  they  optimize  different  parameters  of  the 
problem.  Theorems  1,  2  and  4  present  solutions  with  only  a  very 
moderate  increase  in  total  memory  size  (0(1)  additional  cells  per 
processor  in  theorems  1  and  2  and  O(log  N)  in  Theorem  4)  and  only  a 
moderate  increase  in  running  time  (a  multiplicative  factor  of  0(1)  in 
Theorem  1  and  O(log  N)  in  theorems  2  and  4).  However,  all  three 
theorems  require  a  non-linear  number  of  memory  modules  thus  increasing 
the  size  of  the  interconnection  network.  Theorem  4a  is  our  best  result 
in  that    respect. 

On     the      other     hand,      both     parts      of      Theorems    2   provide   us  with 

1+e 
solutions   with   a   small   number  of   memory  modules    (p    in  part   a      and     p 

in      part      b)      and   modest  increases    of    running    time    (O(log  p ) )   in  part   a 

and  0(1)   in   part    b).      However,    both   parts    force   us      to      increase      total 

memory    size  considerably. 

Theorem     4      of      the      previous      section     has      the   potential    (if   the 

conjecture  were      true)      of      combining      most      advantages      of      the      other 

schemes.        With     only      two      memory  modules    per  processor  and   only  log  N 

additional  memory  cells   per   processor   it   achieves   a    slowdown     of      O(log 

p/log   log  p). 


III.      Efficient   Deterministic   Simulations 

Say  that  each  of  the  N  addresses  is  contained  in  exactly  c  of  the 
m  memory  modules  for  some  integer  c(c>l).  Each  such  distribution  of 
addresses   is   called   a  c-partitioning. 

We  may  break  each  cycle  of  the  EREW  PRAM  into  two  halves:  one 
includes  ail  read  instructions  from  the  common  memory,  while  the  other 
includes  all  the  write  instructions  of  the  cycle.  This  enables  us  to 
classify  the  cycles  into  reading  cycles  and  writing  cycles  without 
nultipiying  the  running  time  by  more  Chan  a  factor  of  two.  We  argue, 
in  the  paper,  that  the  worst  case  time  for  simulating  writing  cycles 
worsens  only  a  little  while  the  worst  case  time  for  simulating  reading 
cycles     improves      a     lot.        Many      simulations      of      programs        for       the 
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Ultracomputer  have  been  run.  The  ratio  between  read  instructions  and 
write  instructions  that  relate  to  the  common  memory  was  around  8:1. 
This  fact  is  important  since  in  case  a  store  instruction  into  the 
common  memory  is  executed  we  need  access  to  all  copies  of  the  memory 
address,  while  in  case  a  fetch  instruction  is  executed  one  of  the 
copies  would  suffice.  See  more  on  writing  cycles  in  the  last  chapter. 
Our  main  concern  is  to  study  the  advantages  and  limitations  of  the 
copying  approach  for  simulating  reading  cycles.  Our  analysis  applies 
also   to   input   addresses    (whidi   are   only  read). 

We  start  this  chapter  by  studying  some  limitations  of  the  copying 
approach.  Then,  a  specific  c-partitioning  is  proposed.  For  this 
c-partitioning  we  give  upper  bounds  for  the  optimal  R,^„  achievable. 
We  include  also  two  sections  that  discuss  how  to  compute  efficiently 
good  assignments  of  address  requests  to  modules.  The  first  of  these 
sections  deals  with  polynomial -time  sequential  algorithms  for  computing 
optimal  assignments,  namely  with  minimum  ^n^v*  ^^  second  section 
suggests  fast  parallel  algorithms  that  give  low  (but  not  necessarily 
minimum)  ^^lax*  ^'^^  ^^  purpose  of  fast  simulation  we  only  need 
algorithms  of  the  second  kind.  The  proposed  c-partitioning  is  very 
efficient  when  this  stage  Is  applied  separately.  However,  an 
alternative  c-partitioning  which  combines  well  with  the  first  stage  is 
considered  subsequently.  The  last  section  of  this  chapter  demonstrates 
that   a  number  of   copies    c   is   useful  when  [    J    is  much    larger   than  N. 

Note  that  the  number  of  simultaneous  requests  for  trvemory  addresses 
Is  <  p.  In  order  to  use  one  parameter  less,  and  thereby  simplify  the 
presentation,  we  consider  the  most  difficult  case  only;  i.e.,  where 
this  number  is  p.  It  will  be  straightforward  to  extend  our  results  to 
cases  where    this    number   is    <   p. 

Ill   1 .      Lowe r  Bounds. 

Assume  that  some  c-partitioning  is  given.  Problem:  Find  a  lower 
bound  for  the  optimal  worst  case  time  delay  (R^-ov^  ^'^^  ^"^  P  reading 
requests    for  memory  addresses. 

Let  1<  ij  <  i2«'.^  ^k  ^  m  be  k  modules  and  ^^^  i ,  .  oj .  •  • ,  jj,  be  the  set 
of  all  memory  addresses  such  that  all  their  c  copies  are  contained  in 
memory  modules   ipi2, . . .  .i^.      We  are    looking   for  a      (very     unfortunate) 
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set  of  p  memory  addresses  such  that  all  their  copies  are  contained  in  a 
ratnimum  number  of  modules.  More  formally,  we  are  looking  for  a  minimum 
size   subset   of   modules    {ipi2.  •  •  •  .i^^}   such   that    ^Ajj^i»i^2' *  *  *lk  ^    P* 

Claim.  The  equation  \i^AJn,±2, . . .  ,±^  "  ^  k-c^ '^  holds  for  any  k,  c  < 
k  <  m.  The  summation  on  the  left  hand  side  is  on  all  (?)  subsets  of  k 
modules.  The  right  hand  side  gives  an  alternative  evaluation  which  is 
based  on  the  fact  that  each  memory  address  is  contained  in  c  modules; 
f  ixing  k-c  additional  modules  (in  any  of  the  ( j?!^)  possibilities)  gives 
a  subset  of  k  modules  containing  this  address.  This  completes  the 
proof  of   the   claim. 

Hence,   there    is    at    least    one    subset    of  k   modules    that    contains 

("15^/(0    =  k...(k-(c-l))N/(m...(m-(c-l))) 

elements.      We    are   looking   for   the  minimum  k    such   that 
(1)  p<      k...(k-(c-l))N/(ra...(ra-(c-l))). 


Denote    this   k  by  K    .      This   implies    that 

OBSERVATION    1.      R^^  >    P^%' 

If     N  =    (™)x     for     some      integer     x,      then  from  inequality   (1)  we   get   p 

<■    k.  ..(k-(c-l))x/c!    or   p  <    (^)x.      This    implies 

OBSERVATION  2.      For  N   =■    (^)x,    R^^^  >    p/K     where  K     is    the   smallest 

k   satisfying   p  <    (^). 

It  is  shown  later  that  the  last  lower  bound  meets  exactly  an  upper 
bound  on  \jax  ^°^  *  specific  c-partitioning  that  we  propose. 
Therefore,  we  delay  presentation  of  explicit  evaluations  until  this 
later  discussion. 


Ill  2.      The  Proposed   c -partitioning 

Our  suggested  c-partitioning  is  simple.  For  N  <  (^)  and  an  address  i, 
0  <  1  <  N,  take  the  i-th  subset  of  c  modules  (out  of  m  in  the 
lexicographic  order)  and  put  a  copy  of  address  i  in  each  of  the  c 
modules.      If    (x-l)(™)    <  N   <   =■  x(™),    for  some    integer  x,    then     partition 
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the     addresses      into  x  approximately  equal   subsets    (layers)  and  fix  the 
c-partitionlng  of  each  of   the   x  layers   separately  as   for  N  <    (™). 

2 
Remark;      For  an  address   i  it   takes   0(c)   time   to  compute   the   subset 

of   c  modules    containing   its    copies.      Clearly,    ij   =  i   (mod( [N/[  °]) ) )     is 

the  serial  number   of    i   in  its    layer.      The  minimum  lj(>c)    such    that  (     l) 

c 
>    i'    implies    that  module   number   ij"!    contains   a   copy   of    address    i.      The 

computation      of   ii,    takes   0(c)   time.      This   is    because   we   get  a   constant 

difference      approximation      to      t^      by      Stirling      formula      and      explicit 

presentation      of   i  j^ .      Denote   ±2   ~   ^\~\     ^r   >'      "^^  rainiraum   i2(^c-l)    such 

that  (    2Z|   )    ^    i'    implies    that  module    number   i2~l    contains      a      copy     of 

address    i.    This    takes   O(c-l)    time;   and  so   on. 


Ill  3.      Upper  Bounds 

Theorem   1.      Let    t   be   an    Integer,   N    =  x(™),    S^    =min{s/(px  >    (t-l)s+l} 

and   r  =   Sc(t-1)+1.      If    Sc  <    m   then:    (a)  There   exist    r     address    requests 

which   cause   memory  contention   of   at    least   t(R  >    t)    for  any    assignment 

of    requests    to  copies.      (b)    For  any  r-1    requests,   however,    it   is   possible   to 

get  '^max  "    t-1- 

Proof      :      We      show    (b)    fLrst.      Say,    in  contradiction,    that   a   set    of    r-l 

requests    is    given  such    that    the    best  R,,^^  obtainable  is    t.    (It   will      be 

easy      to      modify      our  argument   if   the    best   R,,^v   that    can   be   obtained   is 

greater    than   t).      Among   all  possible  ways   to   partition      requests      among 

our      modules    which  imply   R-ov'''-    choose   these    that   assign   minimum  number 

of   modules   with    exactly   t    requests.      Then,    restrict    further    the      choice 

to     a  partition  where   a   module   with   the    smallest   serial   number  possible 

is   assigned  with  t    requests.      So  we  have    t    requests      for     addresses      in 

some     module.        All  these  addresses   have   other   copies    in   other  modules. 

Each   of   these   modules    is    assigned  with   at    least    t-1      requests(!).        Let 

us      denote  a    lower  bound   on   the   number  of   modules    containing   all   copies 

of    these    t  addresses    by   Si,    then  we  have,    so    far,    requests    for  at    least 

(t-l)Si+l      addresses.        Denote      by      S2      a      lower  bound   on    the   number  of 

modules    containing   all   copies    of    the    (t-l)S,+l    addresses.      It    should  be 

clear     how      to      define   S2,S^, (note   that    we   actually  choose  Sj-l, 

S2~S,,    S2-S2,...,  modules    In    the    corresponding    steps   of    this    process) 
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Clalm  U  Each  module  which  enters  the  scene  is  assigned  with  at  least 
t-1   requests. 

Proof.  Our  sequence  satisfies  the  following.  There  exists  a  directed 
path  of  the  following  form  from  the  first  module  (the  one  we  started 
with,  v*iich  is  assigned  with  t  requests)  to  each  module  which  is 
counted  by  the  S^  numbers;  the  nodes  along  the  path  are  modules;  and 
there  is  a  directed  edge  from  module  A  to  module  B  if  module  A  is 
assigned  with  some  address  request  and  another  copy  of  this  address  is 
located  in  module  B.  By  propagating  requests  along  this  path  we  may 
decrease  the  number  of  requests  assigned  to  the  first  module  by  one, 
increase  the  number  of  requests  assigned  to  the  last  module  on  the  path 
ty  one  and  not  change  the  number  of  requests  assigned  to  any  other 
module.  Thus,  the  existence  of  a  module  as  in  claim  1,  which  is 
assigned  with  less  than  t-1  requests  contradicts  the  choice  of  the 
partitioning   of    requests    among  modules    above. 

S^    =   min{s/(px  >    t}  ,    S^^^    =  min{s/(|)x  >    S^(t-1)  +  1},    for   i  >    I. 


Claim  2.      The   sequence    (of    integers)    {S.}    converges   to  S^. 

Proof.        Recall  sc  <    m.      If    S,    <   S^   then  S._|.,    satisfies    S.    <   S.  ,,   <    Sc. 

If    Sj    =  3f    then   s._(..    =   Sc, 

So     we      could     not      start     with      less      than      Sf(c-l)+l        requests.  A 

contradiction. 

Proving      (a)      is     easy.        Take      r    (=  3f(t-l)+l)    requests    that   all   their 

copies   are    in   Sc  modules.      This   completes    the    proof   of   Theorem   1. 

The  Connection  with  the  lower  bound.  We  showed  that  for  any  r  address 
requests  the  smallest  number  of  modules  to  contain  their  copies  Is  the 
smallest  k  satisfying  r  <  (-)x,  which  is  exactly  our  S^.  Then  we 
concluded  that  ^rn^x  ^^  ^^  least  Che  smallest  Integer  satisfying 
'^raax  ^  ^^^f  Here  this  integer  is  t  which  is  exactly  the  best  ^^^^^-^ 
achievable.  So,  the  upper  bound  and  the  lower  bound  are  exactly  equal. 
For  any  p    requests    let    us   establish   an  explicit   upper   bound    on    the 
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best  R         achievable.      Denote   this   R_„_  by  t.   Let  x  be   the   integer  such 
that 


(x-l)(3    <N<    x(3    . 

Similar  to  the  proof  of  Theorem  1  we  restrict  the  choice  of  the 
assignments  of  requests  to  modules.  Then,  we  start  with  a  module  which 
is  assigned  with  c  requests.  Now,  s  is  the  minimum  number  of  modules 
that  have  to  be  assigned  with  at  least  t-1  requests  in  order  not  to 
enable  us  decrease  t  in  the  first  module.  The  following  inequality 
holds  from  similar  reasons  to  the  proof  of  Theorem  1,  x(  ^J  >  (t-l)s+l. 
It    implies 

X  s^/c!    >     (t-Us  *    s  >    (t-l)c!/x)^^^"^ 
Together  with   the    fact    p  >    (t-l)s+l  we    get 

(t-l)((t-l)c!/x)^^'="^  <    p      implying 
(t-i)c/c-l   <;    p/(^,/^)l/c-l    ^ 

(t-l)C  <    xp^"Vc!    and 
t  <     1    +   (xp'^'Vc!)^/*^    . 


Thus, 

Theorem  2.      R^^  ^    ^    +  ( (p"""  Vc  !  )  [N/ (™)  ]  )^ '^.    ([x]    denotes 

the   smallest   integer   i  such    that    i   >    x) 

III  4.      Optimal  As s Ig me nt . 

The  problem  of  optimal  assignment  to  the  'right'  copy  of  each 
requested  memory  address,  in  order  to  minimize  '^^lax'  ^^  solvable  in 
polynomial    time.      We   actually  solve   the   problem   for  any  distribution  of 
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coples   among  modulea    (not  only  c-partltionings),    in  Figure  ^.    For     more 
on  the  Max-FloM   problem,   see   [Even]  . 
An  alternative  solution. 

We  have  to  assign  p  requests  to  modules  .  Assign  one  request  at  a 
time.  Say  that  i  out  of  the  p  address  requests,  for  some  0  <  i  <  p, 
got  already  (temporary)  assignment  to  modules.  Assume  that  these  i 
requests  are  assigned  to  modules  in  a  way  which  minimizes  R,jiax  ^^^^ 
respect  to  them.  Denote  this  R,^^  by  R,^^  ^'  We  show  how  to  extend 
this  assignment  to  another  temporary  assignment  for  one  more  request, 
in  a  way  which  achieves  minimum  R_^_v  ^^th  respect  to  the  i+1  requests. 
Our  algorithm  and  proof  are  similar  to  the  proof  of  Theorem  1.  Consider 
the  following  auxiliary  directed  graph.  Nodes  represent  modules. 
There  is  an  edge  from  module  A  to  module  B  if  module  A  is  assigned  with 
a  request  for  some  address  a  while  B  contains  another  copy  of  a.  The 
algorithm  for  extending  the  assignment  of  i  requests  to  an  assignment 
of    i+1    requests    is    as   follows. 

(1)  Assign   the    (i+l)-st   address    request   to  a   module  M.    containing  one 
of      the    copies    of   the   address.      (Add    to  the   auxiliary  digraph   c-1    edges 
from  this   module   to    the    other     modules      that      contain      copies      of      this 
address. ) 

(2)  If   Ml    is  now  assigned   with  <    R„„^   4    requests    then  we    are   done. 
(It    is  impossible  to   add    requests    and   decrease   R^^ax' ^ 

(3)  If    M,    is  now  assigned   with  R__^  4+1    requests   search   the 

1  TT13  Xy  X 

auxiliary  digraph  for  a  module  which  is  assigned  with  less  than  R^^^^  ^ 
requests  and  is  reachable  through  a  directed  path  from  M^.  (Any 
efficient  search  can  be  utilized  here.  For  instance  Breadth  First 
Search)  If  such  a  module  is  found  then  propagate  request  assignments 
along  a  path  in  the  digraph  from  M^  to  this  module.  (This  results  by 
the  following  change:  the  number  of  requests  assigned  to  this  module 
increases  by  one.  Note,  however,  that  it  is  still  <■  R,,^^  ^'  The 
number  of  request  assigned  to  all  other  module  is  exactly  the  same  as 
it  was  for   the    i    requests.) 

(4)  If    no  such   module   is    found    in   (3)   do  nothing    (the  request   is   assigned 
to  Mj). 

The  only  thing  that  has  to  be  proved  is  that  step  (4)  is  correct. 
In   other  words:   why   is   it   impossible   to  assign  these   i+1    requests     with 
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^max  =  ^max.i   ^^'^^^^   ^^"^  '^max  =  ^max.i  +  1    ?   Assume,    in  contradiction. 

that   there   is   an  assignment   of   these   i+1    requests   with  Rj^^     =     '^max  i* 

In   this    contradictory  assignment   H^    is   assigned  with  <    ^max  i   requests. 

Therefore,    there  exists   a  request   for  some  address   a,   which  is   assigned 

to     M,      in     our      assignment      (the     result      of      step    (4))   but  not   in   the 

contradictory  assignment.      Hence,   the  auxiliary   graph   contain      an      edge 

fromM^    to  M2  -  the  node   representing   the   module   that   the   request   for  a 

is   assigned    to  it   in   the    contradictory  assignment.      By  step    (3),    M2   has 

R  J      request      assigned      to      it      in      our      assignment.        Define      S,    an 

auxiliary  set    of    modules,    to   include      presently     Mi      and     M2.        In     our 

assignment      both  modules    of    S  have   together    2R  .+1    requests    assigned 

to   them  while   in   the      contradictory     assignedment      there      are      at      most 

2R„„„  J      such      requests.        Therefore,      there      exist      a   request   for  some 
max,  1  ^  '  ^ 

address  a2  which  is  assigned  to  a  module  of  S  in  our  assignment  but  not 
in  the  contradictory  assignment  where  this  request  is  assigned  to 
another  module  not  in  S,  say  M-^.  This  implies  the  existence  of  an  edge 
in  the  auxiliary  graph  from  this  module  in  S  to  M^.  By  step  (3)  M^  has 
R^  J  requests  assigned  to  it  in  our  assignment.  Add  Mo  to  S, 
Similarly  we  show  that  the  set  S  can  grow  infinitely  large.  The  number 
of  modules  is  finite,  so  we  ?ot  a  contradiction.  Therefore,  the 
assignment   achieved  by  our  algorithm  yields    the   minimum  R„^„« 

III    5.      Fast  Parallel  Approximation  Algorithms. 

The  above  mentioned  optimal  assignment  algorithms  are  of  general 
theoretical  interest  and  might  be  relevant  for  bata-base  applications 
as  mentioned  in  the  introduction.  However,  the  simulation  requires 
fast  parallel  algorithms  for  the  assignment  problem  even  if  the  optimal 
result  is  not  achieved  (remember  that  we  are  after  fast  simulation  of 
one  EREWPRAM  cycle).  This  poses  a  challenge  of  a  new  kind. 
Lemma.  Let  N  <  (™)  be  the  rumber  of  addresses.  For  p  address  requests 
we  can  achieve  R^^^  <  cp*-~  '^  in  parallel  time  0(c  +c  log  p).  For  c^Z, 
we  can   do    it   in  constant    parallel    time. 

Remark.      The  claim   for  c»2    needs   a    little   bit   stronger  assumptions   about   the 
MPC;    like,    a  memory  request   can   be   dropped  by  the   requesting   processor, 
or,    alternatively,    a  memory  module  can   discard   memory  requests   sent      to 
it   if  their   location   in   the   module's   queue  exceeds  a   certain  point. 
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Proof    .      By  induction  on  c. 

For     c=2     start  with  any  assignment   of  address   requests   to  copies.      For 

all  modules   where   the  number  of   requests    is     >  i^p      (there     exists     <   /p 

such     modules)   switch   all  requests   above   the  /p  line   (by   doing  that   for 

all  requests   that   were  not  responded   in  the   first  v' p  time   units  of      the 

cycle     which      is      being   simulated,    we  avoid   computation  of   partial   sums 

for  c=2)   to  the   other   copy.      Second  copies   belong   to  pairwise     distinct 

modules.        Therefore,      each     module      gets      no  more    than  /p  requests   for 

second   copies.      Now,   we  are    ready   for   the    Inductive   step.      For  c  copies 

cut      in     the      first     copy     over   p^^~   '^\      Get    (at   most  p^    ''^')   sets   of 

address   request   to  be   switched    to  another   copy.      The  other      c-1      copies 

m-1 
for      each      set      relate      like   c-l    copies    of   a    subset    of   [  -_i J    addresses. 

The   somewhat   tedious   exact      implementation      details      are      left      to     the 

interested      reader.        The      following      outline  will  be  helpful   for   this. 

The   decision  where    is    the   p^  '    line      is      based      on     scheduling      the 

requests      for     the      modules      in      a     way  which  utilizes    both   sorting  and 

partial    sums   and  described    in  detail   in   the   next   chapter   for  a     similar 

purpose.  The      scheduling      of      requests      to      second      copies      is      done 

separately  in  each  set    of    requests,    that   were    switched      from     the      same 

module,      and      so      is      the      scheduling      for      later      copies   which  is    done 

separately  for  even  more      refined     sets.        The      definition      of      such      a 

refined    set    (at  each   transition    from   i   to    1-1    copies    and   application  of 

the   inductive   step)    includes   all  address    requests    that      passed      through 

the     same      sequence      of      modules      but  have    not   yet    been   satisfied.      The 

partial    sums   computation    for  such   set   meets    exactly      the      partial      sums 

confutation     scheme      described      in     the   next   chapter.      (Here,    we   do  not 

need  the   assumptions   of    the   remark  for  c=2    since    switched    requests      are 

not     sent      in      the    first   place.)  Therefore,    we  can  apply   the   induction. 

Let  A^  be    sizes    of    switched    sets    then 

I    (c-1)a((<^-2/(c-1))<    (c-l)pl/Cp((c-l/c))((<^-2)/('^-l))   ,   (c-l)p((c-l)/c) 


The    inequality  holds  because    the    left   hand    side    Is    maximized     when 

2 
all     the      A.      numbers      are      equal.        The      c      size   is    due   to    the   need   to 

compute    the    subset    of    modules    that    contain   a   copy   of   an   address   by  each 
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2 
processor     using   Its    local  memory.     The  log  p   is   due  to  computations   of 

partial  sums  which  is   required  at  each  of   the  c   steps   of   the   induction. 

Theorem  3.      We  can  achieve   R^^  <    c(p'^"Mn/(™]  D^/^   ^^  parallel  time 

0(c^+c   log^  p). 

Proof.      Let  x  be   the   integer  such  that    (x-l)(^]    <  N  <    x(  ™) . 

Apply    the   ieraina    for  each   of    the   x  layers,    separately.      Let      p^^      be      the 

number  of    requests    for  address    in   the    i-th   layer   i<  i<  x.      Then  we    get 


\ax^   I    ^max^^X^    ^A"''^^'     '    x  c(  p/x)  (-l>/<^   = 
1  1 

^p(c-l)^)l/c,    ,(pC-lfN/(m^])l/c, 


R»,ov(i)      is      the  R„_^  obtained  for   each    layer.      The    first    inequality   is 

TTld  X  TT13  X 

obvious.  The  second  is  implied  by  the  lemma.  The  third  is  because  its 
left  hand  side  is  maximized  when  all  the  p.  are  equal.  The  scheme  of 
partial  sums  computation  of  the  next  chapter  should  be  used  here  as 
well. 

Ill  6.      Connection  with  the  First  Stage. 

The  proposed  c-partltioning  above  poses  some  difficulties  in 
combining  it  with  the  probabilistic  simulation  of  the  previous  chapter. 
There,  a  copy  of  address  a.  could  have  been  found  in  module  h(aj|^)(mod 
ra).  While  in  this  chapter  the  set  of  modules  that  contain  copies  of 
address  h(a.)  was  selected  differently.  Each  address  is  assigned  to  a 
set  of  modules  containing  Its  copies  (according  to  the  deflation  of  the 
proposed  c-partitioning  at  the  beginning  of  Section  Til  2)  where  no 
function  like  the  remainder  mod  m  seems  to  be  involved  in  determining 
any  member  of  this  set.  (For  Instance,  module  h(aj)(mod  m)  may  not 
have  a  copy  of  address  a,).  This  is  the  reason  why  for  the  purpose  of 
combining  this  stage  with  the  previous  one  we  propose  an  alternative 
c-partitioning.  Its  small  disadvantage  is  that  it  gives  a  little  bit 
inferior  results  than  the  first  c-partitioning.  Its  big  advantage  is 
that  it  suits  to  be  a  second  stage  following  Chapter  IT.  Note  that  in 
the  sequel  we  always  omit  the  hashfunction  h  and  refer  to  the  address 
a^^  only.  When  the  solution  of  this  section  is  set  to  follow  Chapter  II 
address   a^    should  be  replaced  by  h(aj|^). 
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The  alternative  c-partltlonlng.  For  N  <  ra!/(ra-c)!  and  address  i, 
CK  KN,  the  first  copy  of  1  Is  In  module  i(mod  m) ,  the  second  copy  is  In 
module  l(mod  (m-1))  of  the  remaining  m-1  modules  and  so  on.  Namely  the 
;}-th  copy,  1<  j<  c,  is  In  module  l(mod  (m-j+1)  of  the  m-j+1  modules  not 
occupied  ty    the   first    j-1   copies. 

Example.  Let  m  =  10  c  =  3  and  i  =  1.  The  first  copy  of  1  is  in  module 
5  (15(mod  10)  =  5),  the  second  is  in  module  7  (15(mod  9)  =  6  and  since 
module  5  is  occupied,  module  7  corresponds  to  6)  and  the  third  is  in 
module    9  ( 1 5(mod  8)    =   7   and  module    5   and   7   are   occupied). 

If  (x-1) (m! /(m-c) ! )  <  N  <  x(m!/(m-c)!)  for  some  integer  x  then 
partition  the  addresses  into  x  approximately  equal  subsets  (layer)  and 
fix  the  c-partitloning  of  each  of  the  x  layers  as  for  N  <  m!/(m-c)!. 
Remark.  For  an  address  t  it  takes  0(c  log  c)  time  to  compute  the  subset 
of  c  modules  containing  its  copies.  Clearly,  i^  =  i(mod  [N(m-c) ! /m! ] ) 
is  the  serial  number  of  1  in  its  layer.  Find  the  modules  one  at  a 
time.  Create  a  2-3  tree  for  the  modules  that  were  chosen  so  far.  (For 
more  on  2-3  trees,  see  [Aho,  Hopcroft  and  Uiimanl.)  By  keeping  in  each 
internal  node  of  the  tree  Information  about  the  number  of  unoccupied 
modules  among  its  leaf -descendents  we  can  identify  the  module  of  the 
next  copy  and  update  the  tree  Ln  time  O(log  c).  The  simple  additional 
details  are  omitted.  In  the  cases  where  N  =  xc![™)(=  x  m!/(m-c)!)  for 
some  integer  x.  Observation  2  and  Theorem  I  still  hold  since 
ffk^  ,i2,...i^  =  X  c!  for  any  subset  {ipi-, , . . .  ,t,  }  of  k  modules  in  both 
c-partitlonlngs  (using  the  notations  of  the  lower  bounds  section).  In 
order  to  shorten  the  paper  we  reconstructed  here  only  the  upper  bound 
analogues  to  Theorem  2.  This  is  done  in  an  informal  way  since  It 
follows   the   same   lines   as    the    proof   of   Theorem  2. 

For  any  p  requests  we  want  to  find  an  upper  bound  on  the  best  R-o,, 
achievable.  Let  x  be  the  integer  such  that  (x-1 )m ! /(m-c) !  <  N  <  x 
m!/(in-c)!.  For  some  R^^j^  =  t  and  a  module  assigned  with  t  requests 
denote  by  a  the  minimum  number  of  modules  (including  the  first  module) 
that  have  to  be  assigned  with  at  least  t-1  request  in  order  not  to 
enable  us  decrease  t  in  the  first  module.  The  following  two 
inequalities  hold: 

(1)  X   3!/(s-c)!    >     (t-l)s   +    1. 

(2)  p   >    (t-Ds  +    1. 
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(1)  implies   X  s*^  >    (t-l)s   and  s  >    (t/x)*^"^.      This   and   (2)   imply 

(t-l)((t-l)/x)l/<^"^  <    p,        (t-l)C  <    X  p"^"^   and   t  <    1   +  (x  pC-l)l/c 
Theorem   2'.      For  the  alternative  c-partitioning 


R^x<    1    +  (p''"^[N(ra-c)!/m!])^^^ 


The   analogue  to  Theorem   3  will  be 

Theorem  3'.        For  the  alternative   c-partitioning  we   can  achieve 


R™.v<    c(p'^-lc![N(m-c)!/ra!])^/'= 


max 


2 
in  parallel   time   0(c    log  c  +  c    log   p). 

Proof. 

The  only  new  idea   in    this    proof  is    the    following.        We      show     that      all 

addresses   that    fall   in   the    same   layer  of   our  alternative   c-partitioning 

can  be  efficiently   further    partitioned    into   c!    'sublayers'.      Note      that 

the  m!/(m-c)!    addresses    of    one    layer   (there    Is    at   most   such  number)   hit 

every  subset    of    c  modules    exactly  c!    times   each   time    in   the      c     modules 

are     hit    in  a   different   order  of   the    copies.      Let    (i j  ,i2, . . .  ,i^)   be   the 

c   modules    that    contain   the    respective   first,    second,...,    c-th     copy     of 

address    i  that    belongs    to   layer  I    for   some    1   and  I.      Denote   by  N(j)    the 

cardinality   of    {i^^lk<j    and   i^<i.}.      Note  that    these      cardinalities      can 

be      observed  upon  searching    the    2-3   tree   mentioned   above   for   i.   without 

changing    the  O(logj,c)    time    estimate.      Obviously     0  <■    N(j)    <   j.        Define 

L(ipl2, ...  ,1-)    =\    (j-l)!N(j)   to   be  the    sublayer  of  address    i  in   layer 

i.      The   only    address    in   this   sublayer  which      is      contained      in     modules 

ipi2,...,i^        is        i.      Therefore,      the      N     addresses      form     altogether 

c  t  [N(m-c)  ! /m!  ]    sublayers.      Applying    the    lemma  similar    to    the      proof      of 

Theorem  3   to  each  of   this   sublayers   completes   the   proof   of   the   theorem. 

2 
The  0(c    log  p)   size    represents   the   computation   of   partial     sums      as      in 

the   proof  of   the    lemma. 
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III   7»      Some  Noncona truct ive  Upper  Bounds . 

In  this  section  we  demonstrate  a  way  for  utilizing  a  number  of 
copies  c  where  ( ™)  >>  N.  A  typical  counting  argument  that  provides  for 
non-constructive  c-partitlonlngs  having  the  desirable  property  of 
enabling  small  R__^  is  presented.  We  leave  the  problem  of  constructing 
such  efficient   c-partitlonings  open. 

We  first   need   the    following   theorem: 

Theorem  4^:  Let  p  and  m  be  as  before.  Let  d  be  a  positive  integer, 
where  p  <  dm.  Assvme  that  a  c-partltionlng  is  given  such  that  for 
every  subset  of  size  p  of  the  N  addresses,  the  set  of  modules  that 
contain  all  c  copies  of  these  p  addresses  is  of  size  >  [p/d].  Then  it 
is   possible   to   get  R-^^  ^    d   for  every   p  memory    requests. 

Proof.        (Similar     to      the      proof      of      correctness     of      the  alternative 

sequential  algorithm.)   Assume,    In  contradiction,    that      there      exists      a 

set     of   p(<    dm)   requests    Chat   causes  '^^ov'^'^*      ^^  ^   similar   technique  we 

choose  an  assignment    such   that    some      module      is      assigned     with     >      d+1 

requests.         "ttiey     must      have      copies      in     one   more    module.      This  second 

module  must   be  assigned   with  >    d      request.         So      far     we      have     >  2d  +    1 

requests.        They  must   have    copies    in  >    3    =    [ ]   modules.      The   third 

d 

module  is  also  assigned  with  >  d  requests,  and  so  on.  So  we  get  that 
this   set   of    requests    was    of    size   >    dm  +    1.      A  contradiction. 

This  is,  actually  an  extension  of  Hall's  Theorem  (cf.  [Even]) 
which  is  given  for  the  case  d  =  1.  An  upper  bound  on  the  portion  of 
c-partltionlngs  that  do  not  have  the  favorable  property  described  in 
the   theorem  is 


c  -([li]-l]! 

m  ^  'd        '  (cN-ck)! 


T      =       \         r^    r    k  1  "»       d  UN-ck 


m     d 


assuming  that  N  is  divisible  by  m.  (Note,  that  the  definition  of  a 
c-partltionlng  is  extended  to  the  case  where  more  than  one  copy  of  an 
address    is    contained   in   the    same   module.)  This  upper  bound      reminds      to 
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some  extent  the  proof  of  the  existence  lemma  of  "expander  bipartite 
graphs"  given  in  page  298  of    [Pippenger]. 

If  we  prove  for  some  values  of  N,  m,  p,  d  and  c  that  I  (the 
portion  of  "bad"  c-partitionings )  is  smaller  than  one  then  we  gave  a 
(non-constructive)   existence   proof  of  a   "good"   c -partitioning. 

Appendix  I  examplifies  a  result  that  can  be  derived  from  this 
upper  bound.  We  get  that  lfN=p,m=p,d=l  and  p  is  "large 
enough"    than   there  exists    a   "good"    10-partitioning, 

In  a  similar  way  to  the  considerations  above  we  may  obtain  an 
upper  bound  of  1-b  on  the  portion  of  bad  c-partionings,  for  some 
positive  constant  b.  Since  we  do  not  have  a  way  to  construct  a  "good" 
c-partitioning,  we  can  choose  one  at  random  and  use  it  till  the  first 
time  it  fails.  Namely,  if  we  find  for  the  first  choice  a  set  of 
requests  for  which  it  is  impossible  to  get  R,,^^  =  d,  then  we  choose 
another  c-partitioning  and  so  on.  Obviously,  with  high  probability,  we 
find  after   a  small   number  of   trials   a    "good"    c-partitioning. 

IV.      Efficient  Low-level  Simulations. 

The  purpose  of  this  paper  is  to  present  ways  for  simulations  of 
shared  memory  models  of  parallel  computation  which  allow  fairly 
unrestricted  access  patterns  of  processors  to  shared  memory  cells  by 
machines  in  which  the  memory  is  organized  in  modules  where  only  one 
cell  of  each  module  can  be  accessed  at  a  time.  As  was  mentioned  in  the 
introduction  the  paper  envisions  a  three  stage  analysis  and  solution  to 
the  problem.  The  combination  of  the  two  earlier  stages  brings  us  to  a 
point  where  each  processor  of  the  MPC  simulating  machine  specifies  in 
each  cycle  being  simulated  both  an  address  request  and  the  module  that 
was  chosen  to  satisfy  this  request.  The  problem  is  how  to  complete  the 
simulation.  The  choice  of  the  EREWPRAM  and  the  MPC  for  presentation  of 
cur  ideas  for  the  second  two  stages  is  due  to  the  fact  that  the 
sinulatlon  of  the  former  by  the  latter  distinguished  both  the  problems 
and  solutions.  Among  other  things,  It  was  helpful  to  make  the  need  for 
a  third  stage  indistinct  and  thereby  focusing  on  the  previous  stages. 
This  is  the  reason  that  here  we  switch  to  other  models  of  computation. 
Instead  of  the  EREWPRAM  we  take  a  more  permissive  model  of  computation, 
the   concurrent-read  concurrent-write    (CRCW)PRAM.      In   this    model   several 
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processors  are  allowed  to  read  simultaneously  from  the  same  memory 
location.  If  several  processors  try  to  write  simultaneously  into  the 
same  memory  location  the  lowest  numbered  processors  succeeds.  This 
model  is  based  on  [Goldschlager]  and  [Shiloach  and  Vishkin].  We 
substitute  the  MPC  machine  by  a  weaker  machine  named  non-queued 
(NQ)-MPC.  Here  only  one  address  request  may  arrive  at  each  module  at 
each   tiriE   unit.      Besides   that   the  NQ-MPC   is   similar   to    the  MPC. 

We  wish  to  simulate  one  cycle  of  the  CRCWPRAM.  Assume  that  each 
processor  of  the  NQ-MPC  is  assigned  already  both  with  an  address 
request  from  the  common  memory  and  the  memory  module  which  has  to 
satisfy  this  address  request.  Assume  also  that  m  >  p,  namely  the 
number  of  modules  is  at  least  as  big  as  the  number  of  processors.  Our 
proposed  solution  resembles  the  simulation  of  a  CRCWPRAM  by  a  EREWPRAM 
in  [Vishkin  83].  Unfortunately,  our  present  problem  requires  not  only 
the  circumvention  of  access  conflicts  to  the  same  memory  location  but 
also  to  the  same  module.  Whenever  the  considerations  are  similar  to 
this   sinulation  we   shorten    the    presentation.      The      first      step      of      the 

similation   is: 

(1)    Sort    in   parallel    the    p  triples   specified  below  in   the    lexicographic 

order. 

Each  processor  enters   the   triple: 

(the     serial   number  of   the   module  assigned    to   its    address    request, 
the   serial  number  of    the      address      request      Itself      ,      its      own      serial 

number) . 

The     NQ-MPC     sorts      them     using     Batcher   sorting  algorithm  in  time 
O(log^p).      It    does  not   need  more    than   size   p   shared      memory.        This      is 
achieved      by     using      one      cell     of   each   module.      The   result    is   that    ail 
requests    for   the    same  module    (resp.        the      same      address      of      the      same 
module)      appear      in     successive   locations    of    the   sorted   vector.      Let   us 
call  the    set    of   such   successive   locations    Interval    (resp.      subinterval) 
denoted        I(M)(resp.         SI(M,a)).        M     and      a      represent      Indeterminants 
■^corresponding     to     modules      and     addresses,      respectively.        Note      that 
subintervals  are    sorted  by  the   serial  number  of   the   processors. 
(2)  The   serial  number  of   each   subinterval   relative   to    the    other   subintervals 
in  the  sorted  vector   is    computed.      Denote   it   by   /fSI(M,a). 

A     processor     is    allocated    to  each   triple.      By   comparison  with    the 
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triple  in  the  preceding  place  of  the  sorted  vector  the  processor  finds 
out  whether  its  triple  is  the  smallest  in  its  subinterval.  If  yes  it 
is  chosen  to  'represent'  the  subinterval  as  will  be  seen  later;  let  us 
call  the  triple  in  this  case,  a  subinterval-triple.  Each  processor 
that  is  (resp.  not)  allocated  to  a  subinterval  triple  enters  one 
(resp.  zero)  to  a  simple  (O[log  p])  time  partial-sums  computation. 
This  results  with  the  required  subinterval  serial  number  associated 
with   the   subinterval   triple. 

(3)  The  serial  number  of  each  subinterval  which   is   smallest   in  its    interval 

(min(/i*SI(M,b) )  is    'broadcasted'    to  all   other   triples    of    the   interval, 
b 

The     number   //SI(M,a)    -  min(//SI(M,b)  )+l    is    the   serial  number  of   the 

b 
subinterval  SI(M,a)   relative  to  otl^r   subintervais      of      Interval      I(M). 

The      broadcasting      is    done    in    (log   p]    pulses.      In  pulse  i,    1<  i<  [log  p], 

each   processor   that  knows,    already,    the   serial   number   of      the      smallest 

subinterval      of      the      interval      of      its      triple   writes   it    into   the    2 

successor  of  the   triple   in    the    sorted   vector   if    it    belongs    to   the      same 

interval.        It      is   simple   to   see   (see    [Vtshkln   83])   that   all  triple   get 

the   appropriate  message   and   each   module   is   accessed  by  one   processor  at 

a   time. 

(4)  The  processor  of  each  subinterval-triple  performs  the  requested  access 
to  the  right  module  in  time  corresponding  to  the  serial  number  computed  in 
Step  (3)   above. 

In  case,  we  simulate  a  writing  cycle  (recall  the  classification  of 
the  previous  chapter  into  reading  and  writing  cycles)  we  are  done.  In 
case  a   reading   zyt(e   is   simulated  we   finish    by: 

(5)  The   content   read  by  the    processor  of   each   subinterval-triple   is 
broadcasted    to  all   other   triple  of   this   subinterval. 

The        broadcasting      is     done      by     a      similar     technique      to      Step      (3), 

Broadcasting    for   the      same      purpose      is      used      in      [Vishkln      83].        The 

2 
O(log  p)   time    for  sorting   clearly  dominates    the    time   complexity   of   this 

simulation.      It    is    appropriate    to   mention  at    this    point   the      new     O(log 

p)      sorting      algorithms      (for  p   processors)   given  by    [Ajtai,    Komlos   and 

Szemeredi]    and    [Reif  and   Valiant].      In    the    first   algorithm  the   time      is 

miltiplied      by     a      large      constant,    while   the    second    is   a   probabilistic 

algorithm  and    the    constant    factor    is   smaller. 
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V.      Summary 

We  ^ve  above  a  detailed  description  of  each  component  of  our 
solution.  Here  we  would  like  to  give  a  high  level  description  of  a 
slmlation  of  the  CRCWPRAM  by  the  NQ-htPC. 

The  CRCWPRAM  (reap.  NQ-MPC)  is  permissive  (resp.  restrictive) 
relative  to  the  spectrum  of  other  permissive  (resp  restrictive)  models 
of  conputation  that  appear  in  the  literature.  This  enables  us  to  go 
again  through  the  main  notions  of  our  solution  in  order  to  summarize 
them  in  a    unlfox™  fashion. 

Stage  one  assumes  a  class  H  of  hashf unction.  Pick  at  random  one 
of  them  say  h.  This  hashfunctlon  Implies  the  location  of  the  c  copies 
of  each  address  a^,.  By  the  alternative  c-partitioning  of  stage  two  the 
j-th  copy  is  in  module  h(aj^)(raod  m-(j-l))  of  the  modules  that  were  not 
occupied  by  the  first  j-1  copies  (U  j<c).  The  step  by  step  simulation 
starts  with  each  processor  of  the  NQ-^Q'C  machine  specifying  an  address 
request  for  read  or  write  like  its  corresponding  CRCWPRAM  processor. 
Now  we  have  to  split  into  reading  and  writing  cycles.  For  reading 
cycles  Theorem  3'  gives  an  algorithm  for  allocation  of  address  requests 
to  modules.  A  scheduling  of  the  requests  to  time  units  of  the 
siimlated  cycle  and  a  way  to  transmit  the  read  request  to  the  modules 
and  the  response  back  to  the  processors  are  described  in  the  previous 
chapter.  For  writing  cycles  each  address  request  is  assigned  to  all 
Its  c  copies.  We  do  not  do  worse  than  c  times  (the  time  for  one 
reading  cycle).  The  reader  is  invited  to  by  observing  that  we  can 
access:    first   copies,    then   second   copies,    and    so  on. 

It  is  interesting  to  note  that  both  the  Uitracomputer  and  the  PDDI 
machine  are  able  to  compute  partial  sums  in  the  same  time  it  takes  for 
the  processors  to  access  the  memory.  So  in  both  machines  the  term 
log^p  in  tt«  time  evaluation  of  theorems  3  and  3'  and  the  preceding 
lemma  can  be  omitted.  They  also  allow  simultaneous  access  by  several 
processors  to  the  same  cell  of  a  module.  It  is  resolved  in  the  same 
way  as  in  the  CRCWPRAM  within  the  same  time  as  accessing  the  memory. 
So  most  of  the  discussion  of  Chapter  IV  is  irrelevant  for  both 
machines. 
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Appendlx  K 
The  following  example  may  illustrate  what  are  the  results  that  one 
nay  expect  out  of  Section  III  7.   No  special  effort  was  made  to  get  the 
best  possible  result  in  the  subsequent  computation. 

3      2 
Example;  Let  N  =  p  ,m  =  p  and  _d  ^  _1_.   Then, 


-I    if)  it)  i'ii)^iii') 

KWp 
Since 

(cf)    >if)    if)    i'^'d'-f'/] 
we   get 

<        I        i''l'')/i^'-ilf-/) 
Since 

\k  ^  (ck)! 

and 

,(,_!)    3      2          ((c-l)pV-(c-2)k]<'^-^>'^ 
^       (c-5)k     J        ((c-2)k)! 

we   get 

V  (ck)! (cpk)*^^ 

kI,   ((c-2)k)!^(,_,)pV-(c-2)k)^-^^- 


Now  ,/^'',^\  <  (ck)2'^;  and  for  p  "large  enough"  p^  >  p^  +  (c-2)k  where 

((c-2)k)! 
c  is  a  constant  and  k  <  p.   So 
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KWp   ((c-2)p   )^ 


Caii  each  element   L^.      Let  c    =   10.      For  large   enough  p 


10W!<,/p 
(8p^)^ 


Since      it      is      easy   to   verify   that   the   derivate   of   l^  with   respect   to  k 
(1   <    k  <    p)      is      negative      for      large      enough      values      of      p,      we        get 
)         L,     <    1.         So,      we      proved   the   existence   of    a    10-partitioning   such 
t^^Pfor   these    parameters    gives   always   a   memory   contention  of   one. 
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